A Procedural Definition of Multi-word Lexical Units
نویسندگان
چکیده
Multi-word expressions evade a closed definition. Linguists and computational linguists rely on intuition or build lists of MWE types; while practical, that is scientifically and aesthetically unsatisfying. Without presuming to solve a daunting theoretical problem, we propose a decision procedure which steers a lexicographer toward acceptance or rejection of an N-gram as a lexical unit: a decision tree classifies N-grams as MWE or not MWE. It will succeed if it agrees with the native speakers’ judgment. We need a small, linguistically credible set of features, to contend with the multiplicity of adequate trees. Decision tree induction works with a fixed set of annotated classification examples, but the lexical material for MWE recognition is too large to make annotation feasible. We rely on small-scale statistically significant sampling, and on intuition. Of a few decision trees produced by informed trial and error, we select one we consider best in our circumstances. That tree, deployed in a large-scale wordnet construction project, allowed us to gather dependable statistics on its usefulness in lexicographers’ work. Our goal: systematic expansion of a wordnet by tens of thousands of MWEs in a manner as free of personal biases as possible. 1 Motivation Multi-word expressions (MWEs) are present in almost every lexical resource. Their recognition can facilitate many natural language engineering tasks: information extraction, automated indexing, question answering and machine translation, to name a few. The unwavering interest in MWEs contends with the vagueness of the notion itself. There are too many, and too divergent, descriptions of just what an MWE is. Computational linguists have sought – with mixed success – a clear, “closed-formula” definition. It turns out that not only is the term “multi-word expression” not visible in linguistic literature, but that there also is no consensus on fixed phraseological expressions, non-compositional expressions, idiomatic expressions, lexicalised expressions, collocations etc. Most sources in traditional and computational linguistics alike seem to make do with a list of types of lexical connections in lieu of a definition. That may be practical, but it is neither scientifically nor aesthetically satisfying. Piasecki et al. (2009) and Maziarz et al. (2013) present plWordNet, a very large wordnet and a comprehensive lexical resource for Polish. It describes most of Polish single-word lexical units and many multi-word expressions, but the coverage of the latter must increase significantly. Before that has happened, one needs to decide what are MWEs which merit inclusion in plWordNet, and how to make a group of lexicographers apply the definition consistently when they work on wordnet expansion. We aim to develop a decision procedure which steers a lexicographer toward unequivocal acceptance or rejection of an N-gram as a unit in the lexical system of the language at hand. Just like a formal grammar sets precise boundaries to include things intuitively ungrammatical and exclude things intuitively grammatical, an MWE decision procedure cannot be perfect. It will be a success if it agrees to a high degree with the
منابع مشابه
On multiword lexical units and their role in maritime dictionaries
Multi-word lexical units are a typical feature of specialized dictionaries, in particular monolingual and bilingual maritime dictionaries. The paper studies the concept of the multi-word lexical unit and considers the similarities and differences of their selection and presentation in monolingual and bilingual maritime dictionaries. The work analyses such issues as the classification of multi-w...
متن کاملDevelopment of Word Definition Skill in Persian-speaking 54-90-Month-Olds
Objectives: Word definition skill is a complex language ability in which meta-linguistic awareness and literacy skills play a critical role. The present study examined the development of word definition skills in Persian-speaking children aged 4.5 to 7.5 years, concerning content and form aspects. Methods: This was a cross-sectional and analytic-descriptive study. The study subjects were 107 c...
متن کاملOn the Role of Derivational Processes in the Formation of Non-Taxonomic Classes of Lexical Units in Russian
The paper is focused on classes of lexical units which arise as a result of derivational processes – word formation and semantic transfers, acting either in isolation or together, on the basis of common semantic foundations that bind targets and sources of derivation. The lexical items which constitute the classes under study vary in their denotative characteristics and due to their categ...
متن کاملlanguage development and lexical awareness of bilingual (Azeri -Persian) hard of hearing impaired children
The Relationship between Mean Length of utterance (MLU), Lexical Richness and syntactical and lexical metalinguistic Awareness in Bilingual (Turkish-Persian) normal and hearing impaired Children Objectives: Regarding the impact of hearing loss on language development and metalinguistic skill and being language development different from metalinguistic skill in bilingual children, studying of...
متن کاملProducing a Persian Text Tokenizer Corpus Focusing on Its Computational Linguistics Considerations
The main task of the tokenization is to divide the sentences of the text into its constituent units and remove punctuation marks (dots, commas, etc.). Each unit is a continuous lexical or grammatical writing chain that is an independent semantic unit. Tokenization occurs at the word level and the extracted units can be used as input to other components such as stemmer. The requirement to create...
متن کامل